99 research outputs found
A resource-saving collective approach to biomedical semantic role labeling
BACKGROUND: Biomedical semantic role labeling (BioSRL) is a natural language processing technique that identifies the semantic roles of the words or phrases in sentences describing biological processes and expresses them as predicate-argument structures (PAS’s). Currently, a major problem of BioSRL is that most systems label every node in a full parse tree independently; however, some nodes always exhibit dependency. In general SRL, collective approaches based on the Markov logic network (MLN) model have been successful in dealing with this problem. However, in BioSRL such an approach has not been attempted because it would require more training data to recognize the more specialized and diverse terms found in biomedical literature, increasing training time and computational complexity. RESULTS: We first constructed a collective BioSRL system based on MLN. This system, called collective BIOSMILE (CBIOSMILE), is trained on the BioProp corpus. To reduce the resources used in BioSRL training, we employ a tree-pruning filter to remove unlikely nodes from the parse tree and four argument candidate identifiers to retain candidate nodes in the tree. Nodes not recognized by any candidate identifier are discarded. The pruned annotated parse trees are used to train a resource-saving MLN-based system, which is referred to as resource-saving collective BIOSMILE (RCBIOSMILE). Our experimental results show that our proposed CBIOSMILE system outperforms BIOSMILE, which is the top BioSRL system. Furthermore, our proposed RCBIOSMILE maintains the same level of accuracy as CBIOSMILE using 92% less memory and 57% less training time. CONCLUSIONS: This greatly improved efficiency makes RCBIOSMILE potentially suitable for training on much larger BioSRL corpora over more biomedical domains. Compared to real-world biomedical corpora, BioProp is relatively small, containing only 445 MEDLINE abstracts and 30 event triggers. It is not large enough for practical applications, such as pathway construction. We consider it of primary importance to pursue SRL training on large corpora in the future
Rule-based Korean Grapheme to Phoneme Conversion Using Sound Patterns
PACLIC 23 / City University of Hong Kong / 3-5 December 200
WikiSense: Supersense Tagging of Wikipedia Named Entities Based WordNet
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Modeling the Relationship among Linguistic Typological Features with Hierarchical Dirichlet Process
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Large Language Models on the Chessboard: A Study on ChatGPT's Formal Language Comprehension and Complex Reasoning Skills
While large language models have made strides in natural language processing,
their proficiency in complex reasoning tasks requiring formal language
comprehension, such as chess, remains less investigated. This paper probes the
performance of ChatGPT, a sophisticated language model by OpenAI in tackling
such complex reasoning tasks, using chess as a case study. Through robust
metrics examining both the legality and quality of moves, we assess ChatGPT's
understanding of the chessboard, adherence to chess rules, and strategic
decision-making abilities. Our evaluation identifies limitations within
ChatGPT's attention mechanism that affect its formal language comprehension and
uncovers the model's underdeveloped self-regulation abilities. Our study also
reveals ChatGPT's propensity for a coherent strategy in its gameplay and a
noticeable uptick in decision-making assertiveness when the model is presented
with a greater volume of natural language or possesses a more lucid
understanding of the state of the chessboard. These findings contribute to the
growing exploration of language models' abilities beyond natural language
processing, providing valuable information for future research towards models
demonstrating human-like cognitive abilities
Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien
In natural language processing (NLP), code-mixing (CM) is a challenging task,
especially when the mixed languages include dialects. In Southeast Asian
countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the
most widespread code-mixed language pair among Chinese immigrants, and it is
also common in Taiwan. However, dialects such as Hokkien often have a scarcity
of resources and the lack of an official writing system, limiting the
development of dialect CM research. In this paper, we propose a method to
construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome
the morphological issue under the Sino-Tibetan language family, and offer an
efficient Hokkien word segmentation method through a linguistics-based toolkit.
Furthermore, we use our proposed dataset and employ transfer learning to train
the XLM (cross-lingual language model) for translation tasks. To fit the
code-mixing scenario, we adapt XLM slightly. We found that by using linguistic
knowledge, rules, and language tags, the model produces good results on CM data
translation while maintaining monolingual translation quality.Comment: The paper was accepted by EMNLP 2022 finding
Korean-Chinese Person Name Translation for Cross Language Information Retrieval
PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200
- …